SeleScan: Guide

By Pavel Salazar-Fernandez (epsalazarf@gmail.com), in collaboration with Marc Pybus (marcpybus@gmail.com) and Andres Moreno-Estrada (andres.moreno@cinvestav.mx)

Laboratory of Human Population and Evolutionary Genomics (LANGEBIO) & Institut de Biologia Evolutiva (PRBB)

About

This document explains how to set up a selection scan using this repository. Phased VCF files are split into chromosomes and summary statistics are calculated with the R packages PopGenome and rEHH. The results are compiled into a table and then compared to results derived from neutral simulated data to identify outlier regions that may have been affected by selective pressure.

Overview

  1. Split phased VCF file into chromosomes.
  2. Obtain summary statistics with PopGenome: neutrality, Fst, SFS.
  3. Obtain haplotype statistics with eRHH: iHS, iHH.
  4. Generate neutral simulations with ms.
  5. Obtain summary and haplotype statistics (steps 2 and 3) for the simulations.
  6. Merge all results in a summary table.
  7. Scan table for regions with significant stats suggesting selection.

Requirements

Software:

Simulations

Recommended reading:

Hoban, S., Bertorelle, G., & Gaggiotti, O. E. (2011). Computer simulations: tools for population and evolutionary genetics. Nat Rev Genet, 13(2), 110–122. http://doi.org/10.1038/nrg3130

MS

OmicsTools: ms

A Monte Carlo computer program is available to generate samples drawn from a population evolving according to a Wright-Fisher neutral model. The program assumes an infinite-sites model of mutation, and allows recombination, gene conversion, symmetric migration among subpopulations, and a variety of demographic histories. The samples produced can be used to investigate the sampling properties of any sample statistic under these neutral models.

Related software:

Usage:
Basic:
ms nsam nreps -t θ > output.ms

Advanced:
ms nsam nreps -t θ -r ρ -I POPS N1 N2 -ej t i j -en t i x > output.ms

Output:
A text file containing all the generated haplotypes generated. An example output is:

./ms 3 3 -t 3.0
56917 16239 12250

//
segsites: 2
positions: 0.4493 0.7948
01
00
10

//
segsites: 4
positions: 0.2499 0.4466 0.6762 0.8853
0100
0011
1000

//
segsites: 0

Where:

Population Genomics Software

PopGenome

CRAN: PopGenome

PopGenome is a new package for population genomic analyses and method development. PopGenome includes, e.g., a wide range of polymorphism, neutrality statistics, and FST estimates; these can be applied to sequence data stored in alignment format, as well as to whole genome SNP data, e.g., from the 1000/1001 Genome projects. The full range of methods can be applied to whole alignments, sets of sub-sequences, and sliding windows based on either nucleotide positions or on SNP counts. PopGenome is also able to handle GFF/GTF annotation files and automatically specifies the SNPs located in, e.g., exon or intron regions. Those subsites can be analyzed together (e.g., all introns together) or each region seperately (e.g., one value per intron). The PopGenome frame- work is linked to Hudson’s MS and Ewing’s MSMS programs for significance tests using coalescent simulations.

rEHH

CRAN: rEHH

Functions for the detection of footprints of selection on dense SNP data using Extended Homozygosity Haplotype (EHH) based tests. The package includes computation of EHH, iHS (within population) and Rsb (across pairs of populations) statistics. Various plotting functions are also included to facilitate visualization and interpretation of the results.

Selection Scan Pipeline

Directory Tree

Empirical Data Analysis

Input

Note: requirements with * are filtered by split_by_*.sh scripts.

Pipeline

PopGenome
  1. Move or link the VCF file to PopGenome/.
  2. Modify split_by_chr.sh for your data. If needed, add a sample filter.
  3. Run split_by_chr.sh. Individual directories for each chromosome will be created, containing a VCF file for that chromosome.
  4. Run RUN_PopGenome.sh. Results for each chromosome will be written to its corresponding directory.
rEHH
  1. Move or link the VCF file and sample IDs files to rEHH/
  2. Modify split_by_chr.sh for your data, declaring the populations to be analyzed and their respective sample IDs files.
  3. Run split_by_pop_and_chr.sh. Individual directories for each chromosome will be created, containing a VCF file for each population and the corresponding chromosome.
  4. Run RUN_PIPELINE.sh. Results for each population and chromosome will be written to its corresponding directory.

Output

Sample Output
(region) Tajima.D Segregating Sites Rozas.R_2 Fu.Li.F Fu.Li.D Fu.F_S Fay.Wu.H Zeng.E Strobeck.S
### - ### ### ### NA ### ### NA NA* NA* NA
... ... ... ... ... ... ... ... ... ...

NA*: Calculated only if ancestral alleles are provided

(SNP) pop 1 pop 2 pop 3
### ### ### ###
... ... ... ...
(SNP) (Fst)
### ###
... ...
ID CHR POSITION iHS Pvalue
rs### ## ##### ### ###
... ... ... ... ...

Simulated Data Analysis

Input

ms Parameters for ALL

Parameters:

Command:
ms [loci] 1 -t [theta] -r $(( ( RANDOM % [rho_max - rho min + 1]) + rho_min )) [seq length +1] -I 4 [pop1] [pop2] [pop3] [pop4] > output.ms

Pipeline

  1. Setup the parameters in these files:
    • ms_commands_100Kb.sh
    • add_ancestral.pl
    • pop_genome_analysis_simulations.R
    • RUN_PIPELINE_2.sh
    • run_conversion_and_parsing.pl
    • run_conversion_and_parsing2.pl
  2. Run bash RUN_PIPELINE_1.sh (~10 hours)
  3. Run bash RUN_PIPELINE_2.sh (~1 hour)
  4. Run bash RUN_PIPELINE_3.sh (>4 days)
    • Runtime estimation: (Populations * Sequence Length * Simulations) / 4.16e8 = Hours

Output

Sample Output
(region) Tajima.D Segregating Sites Rozas.R_2 Fu.Li.F Fu.Li.D Fu.F_S Fay.Wu.H Zeng.E Strobeck.S
### - ### ### ### NA ### ### NA ### ### NA
... ... ... ... ... ... ... ... ... ...
snp1 snp2 snp3 ...
pop1 ### ### ### ...
pop2 ### ### ### ...
pop3 ### ### ### ...
(SNP) (Fst)
### ###
... ...
ID CHR POSITION iHS Pvalue
chr1_### ## ##### ### ###
... ... ... ... ...

Summary Statistics Compilation

Since all results are splitted to populations and/or chromosomes, an additional run of scripts is necessary to condense all results into genome-wide files.

Summary Statistics: Tables

Requirements

Software

Data
All files produced by correctly-completed analyses from the previous section, not moved nor renamed from their scripted location, and assuming 10,000 simulations.

Procedure

  1. Change location to the base pipeline directory (SelectionScan/)
  2. Set up the populations in the #Parameters section of these files:
    • RUN_CONCAT_EMP.sh
    • RUN_CONCAT_SIM.sh
  3. Run bash RUN_CONCAT_EMP.sh. This will start concatenations for all empirical data (PopGenome and rEHH).
  4. Run bash RUN_CONCAT_SIM.sh. This will start concatenations for all simulated data (PopGenome and rEHH).

Output

The tables generated have the same columns as the input files describe in the Output sections of Selection Scan Pipeline, except for those columns with only NA values.

Summary Statistics: Plots

Requirements

Software

Data
All files produced by correctly-completed concatenations from the previous section, not moved nor renamed from their scripted location.

Procedure

  1. Set up the parameters in the #<INPUT> section of these files:
    • PopGenome/NEUTplot.R
    • PopGenome/SFSplot.R
    • PopGenome/SFSplot.empxsim.R
    • PopGenome/FSTplot.R
    • PopGenome/pairFSTplot.R
    • rEHH/iHSplot.R
    • rEHH/XPEHHplot.R
  2. Run bash RUN_PLOT_ALL.sh. This will generate all plots for the concatenated empirical data.

Output

Plots and tables for Top 1% and Top 0.1% for all stats will be produced.

Sample XP-EHH Output
ID CHR POSITION XPEHH -log10(p-value) [bilateral]
rs### ## ##### ### ###
... ... ... ... ...